In this document, you will find the PCS workflow and code for conducting a thorough EDA of the Ames housing data. Note that each section in this document corresponds to an interesting trend and finding. We did not include every exploratory avenue and every exploratory plot we made in this document.
Following each interesting figure that we explore in this document, we conduct a PCS evaluation to demonstrate the stability, predictability of the take-away message of the figure.
We examined and cleaned the Ames housing data in the file 01_cleaning.qmd. In each subsequent file that uses the cleaned version of the data, it is good practice to load in the original “raw” (uncleaned) data, and then clean it and pre-process it using the cleaning function you wrote. It is often helpful to keep a copy of the original uncleaned data in your environment too.
Note that our pre-processing steps were primarily so that the data would play nice with the predictive algorithms. In general the initial clean data is useful for EDA (sometimes the pre-processed data is too, but we will focus on the clean data for now) to ensure that your perspective is not skewed by the pre-processing steps (such as imputation), but it is also helpful to explore the pre-processed data too since this is the data you will be using in your analysis. You will see us examine both datasets in this document.
library(tidyverse)library(janitor)library(lubridate)library(cluster)library(fossil)library(superheat)# if library(superheat) doesn't work, you might first need to run:# library(devtools)# install_github("rlbarter/superheat")source("functions/prepareAmesData.R")# list all objects (and custom functions) that exist in our environmentls()
What we can see is that there is a strong correlation between the gr_liv_area and sale_price (response) variable, as well as several other area-related variables (garage_area, total_bsmt_sf), and that the year-related variables are highly correlated (garage_yr_blt, year_built).
2 Exploring the response (sale price)
Since our goal for this project is to predict sale price, let’s take a closer look at the sale price variable.
The distribution looks fairly clean, although it is skewed by a couple of particularly expensive houses.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
One option that we explore in pre-processing is log-transforming the response variable. The log-transformed sale price variable is indeed a lot more symmetric (which sometimes can improve prediction performance):
The variables that leap out as being heavily related to living area are gr_liv_area, and the other area variables (x1st_flr_sf, tot_bsmt_sf). Several of these are removed in the “simplifying” pre-processing option though.
We can also physically compute the correlation of each numeric variable with sale price and plot it as bars to quantify these observations. This time, we will look at all the pre-processed variables:
The overall_qual variable stands out as having a strong relationship with the sale price, but it doesn’t look like a linear relationship. Perhaps it looks more linear with the log-transformed sale price variable?
3.1 Log-transformed sale price
Let’s reproduce these plots, but with the log-transformed sale price response variable.
The linear relationships for the log-transformed sale price variable look even stronger now, and so too does the relationship between log(sale price) and overall_qual below:
It might be interesting to see how the neighborhoods compare to one another. Below we print a map of Ames to provide some context.
Figure 1: A map showing where the neighborhoods of Ames are located.
This map was taken from the Tidy Modeling with R book by Max Kuhn and Julia Silge, who also provide a predictive analysis of this dataset. The data that we have does not contain the latitude and longitude information, but they seemed to have a version that did!
The center of the map which contains no houses corresponds to the university of Iowa.
The boxplots below compare the sale price distribution across the neighborhoods.
Below we examine the size (gr_liv_area) for each neighborhood. It seems that not only are NoRidge and NridgHt the most expensive neighborhoods (above), they are also the neighborhoods with the largest houses: